{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 12 - Comparing distributions visually \n", "\n", "This lab will use two days worth of data from the 311 Service Request dataset on NYC Open Data. \n", "\n", "We will download the 311 data from October 6 and 7, 2019:\n", "1. Go to https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 and click \"View Data\". \n", "2. We will filter the data to only contain complaints made on March 3 or 4, 2019: \n", " * If necessary, click on \"Filter\", and then click on \"Add new filter condition\". \n", " * Select the column \"Created Date\" and change \"is\" to \"is between\". \n", " * For the first date, select 10/06/2019 12:00:00 AM \n", " * For the second date, select 10/07/2019 12:00:00 AM \n", " * Check the box to left of the first date. The data should change, so that only those complaints created on October 6 or 7, 2019 show. \n", "3. To download the filtered data, click Export, then CSV. \n", "4. If necessary, rename the file so it will be named differently than the previous 311 data file.\n", "5. Upload your new data file to Jupyter Hub.\n", "\n", "As usual we import the matplotlib and pandas packages and set plots to appear in Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load your new 311 data file into the dataframe `calls`. Use the `parse_dates` parameter to store the `Created Date` column as a `DatetimeIndex` type instead of string (see Lab 2 if you forget how to do this). Recall that this parameter tells Pandas that the `Created Date` column in actually a date/time and not just some random text. Parsing the date takes some extra time, so we have been skipping this step if we are not using the date." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " df_name = pd.read_csv(\"data_file.csv\",parse_dates=[\"column_to_read_as_date\"])\n", "
\n", "\n", "Display `calls` to check that it was created correctly. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing complaint times on Sunday vs. Monday\n", "\n", "First we are going to compare the times complaints were made on Sunday October 6 with the times complaints were made on Monday October 7. We will use histograms plotted on top of each other to visualize and compare the two distributions. Before starting, how do you think timing of complaints will differ between Sunday and Monday?\n", "\n", "First we will create a filter to find all complaints created on October 6. Type `oct6_filter = calls[\"Created Date\"].dt.day ==6` below and run the code. The `.dt.day` gets just the day from the `DatetimeIndex` type and `==6` checks if it is equal to 6. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your October 6 `oct6_filter` to create a new dataframe called `oct6_calls` containing just the complaints that match this filter. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " new_df_name = df[filter]\n", "
\n", "\n", "Now create a dataframe called `oct7_calls` containing only the calls from October 7. That is, create the filter and then use it to select from the `calls` dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now plot a histogram of the hours of the October 6 calls. We can get a Series of the hour each October 6 call was created with `oct6_calls[\"Created Date\"].dt.hour`. Type and run this below to check this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So to make a histogram of the hours, type `oct6_calls[\"Created Date\"].dt.hour.hist()` below and run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the distribution? You may want to increase the number of bins to 24.\n", "\n", "To plot two histograms on the same graph, put the code to plot one histogram on the first line of a cell, and the code to plot the other histogram on the second line of the same cell. \n", "\n", "Replot the histogram of the hours the October 6 complaints were created below and along with the histogram of the hours of the October 7 complaints." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first histogram (October 6) is in blue and the second (October 7) in orange. Can you see all of both histograms? What do you notice?\n", " \n", "Let's make this plot look nicer. To make a histogram transparent, add the parameter `alpha = 0.5`. You can also change the color of a histogram to red by adding the parameter `color = \"red\"`. A list of the possible colors is [here](https://matplotlib.org/examples/color/named_colors.html). \n", "\n", "Can you add axis labels and a title? \n", "\n", "Finally, we can add a legend with `plt.legend([\"Sunday October 6\",\"Monday October 7\"])` (This legend assumes the first histogram is for October 6 and the second is for October 7.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice? What differences are there between the distributions of the times of the 311 complaints on Sunday October 6 and Monday October 7?\n", "\n", "It looks like there are more complaints on October 7 overall. So instead of comparing the absolute number of complaints at each time, it might be better to compare the proportion of complaints at each time. Do this by adding the parameter `density = True` to each histogram function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing complaint distribution by borough on Monday vs. Sunday\n", "\n", "Now we will compare the distribution of complaints by borough on Sunday October 6 vs. Monday October 7. Do you think the distribution will change? If so, how?\n", "\n", "Above we created `oct6_calls` as a dataframe of only the October 6 calls and `oct7_calls` as a dataframe of only the October 7 calls.\n", "\n", "First, get the value counts of the `Borough` column in `oct6_calls` and in `oct7_calls` and store these in the variables `oct6_borough_counts` and `oct7_borough_counts` respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " counts_var = df[\"colunm_name\"].value_counts()\n", "
\n", "\n", "As with the histograms, we can plot two bar charts on top of each other by putting the two plotting functions in the same cell. Make an overlapping plot of the two bar charts of the borough counts for October 6 and for October 7 below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "
Pattern:\n", " counts_var.plot(kind = \"bar\")\n", "
\n", "\n", "We can't see the difference between the two plots, so add a parameter to change the color of the October 6 one to blue and the color of the October 7 one to red (or whatever two colors you like). \n", "\n", "To further distinguish between the two plots, add the parameter `width = 0.75` to the plot function for October 6 and the parameter `width = 0.5` to the plot function for October 7." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think the `width` parameter does? What do you notice about the graph? Which day has more 311 calls?\n", "\n", "Since there are more 311 calls on Monday than on Sunday, it is hard to tell if the distribution between boroughs is different. We can normalize the counts so that they are proportions instead of counts by dividing by the total number of calls on each day. For example, to normalize the October 6 counts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "normalized_oct6_borough_counts = oct6_borough_counts/oct6_calls.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalize the October 7 counts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's plot the normalized counts or proportions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you add a title, axis labels, and a legend?\n", "\n", "We can make all sorts of interesting comparisons with filters. Here is one comparing the type of calls made at night (6pm - 6am) with the type of calls made during the day. Try running the code below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "before_6am_filter = calls[\"Created Date\"].dt.hour < 6\n", "after_6pm_filter = calls[\"Created Date\"].dt.hour > 18\n", "\n", "night_calls = calls[before_6am_filter | after_6pm_filter]\n", "day_calls = calls[~before_6am_filter & ~after_6pm_filter]\n", "\n", "night_counts = night_calls[\"Complaint Type\"].value_counts()\n", "night_counts[night_counts >50].plot(kind = \"bar\",color = \"blue\",width = 0.75)\n", "day_counts = day_calls[\"Complaint Type\"].value_counts()\n", "day_counts[day_counts > 100].plot(kind = \"bar\",color = \"red\",width = 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Challenges\n", "- Plot overlapping histograms of the hours that residential noise and commerical noise complaints are made. Residential noise complaints are listed as `Noise - Residential` under `Complaint Type` and commerical noise complaints are listed as `Noise - Commercial` under `Complaint Type`.\n", "- Plot overlapping bar charts of the borough distribution of no heat/hot water complaints (`HEAT/HOT WATER`) and another complaint of your choice. Is there a different in the distributions?\n", "- Your choice! Pick one variable to compare in two different situations. You can find a list of the different types of information you can get from the`DatetimeIndex` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }